221 research outputs found

    Social Media Variety Geolocation with geoBERT

    Get PDF
    This paper describes the Helsinki-Ljubljana contribution to the VarDial 2021 shared task on social media variety geolocation. Following our successful participation at VarDial 2020, we again propose constrained and unconstrained systems based on the BERT architecture. In this paper, we report experiments with different tokenization settings and different pre-trained models, and we contrast our parameter-free regression approach with various classification schemes proposed by other participants at VarDial 2020. Both the code and the best-performing pre-trained models are made freely available.Peer reviewe

    Neural morphosyntactic tagging for Rusyn

    Get PDF
    The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages.We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.Peer reviewe

    Measuring Semantic Abstraction of Multilingual NMT with Paraphrase Recognition and Generation Tasks

    Get PDF
    In this paper, we investigate whether multilingual neural translation models learn stronger semantic abstractions of sentences than bilingual ones. We test this hypotheses by measuring the perplexity of such models when applied to paraphrases of the source language. The intuition is that an encoder produces better representations if a decoder is capable of recognizing synonymous sentences in the same language even though the model is never trained for that task. In our setup, we add 16 different auxiliary languages to a bidirectional bilingual baseline model (English-French) and test it with in-domain and out-of-domain paraphrases in English. The results show that the perplexity is significantly reduced in each of the cases, indicating that meaning can be grounded in translation. This is further supported by a study on paraphrase generation that we also include at the end of the paper.Peer reviewe

    Neural Machine Translation with Extended Context

    Get PDF
    Peer reviewe

    OcWikiDisc : a Corpus of Wikipedia Talk Pages in Occitan

    Get PDF
    This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext's language identification model from Meta AI's No Language Left Behind initiative, released in July 2022.Peer reviewe

    New Developments in Tagging Pre-modern Orthodox Slavic Texts

    Get PDF
    Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.Peer reviewe

    Natural language processing for similar languages, varieties, and dialects: A survey

    Get PDF
    There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

    Low Saxon dialect distances at the orthographic and syntactic level

    Get PDF
    We compare five Low Saxon dialects from the 19th and 21st century from Germany and the Netherlands with each other as well as with modern Standard Dutch and Standard German. Our comparison is based on character n-grams on the one hand and PoS n-grams on the other and we show that these two lead to different distances. Particularly in the PoS-based distances, one can observe all of the 21st century Low Saxon dialects shifting towards the modern majority languages.Peer reviewe

    ArchiMob : A multidialectal corpus of Swiss German spontaneous speech

    Get PDF
    19. Arbeitstagung zur alemannischen Dialektologie, Freiburg, GermanyObwohl der Dialektgebrauch in der Deutschschweiz zum Alltag gehört, sind digitale Ressourcen für die dialektologische und computerlinguistische Forschung nur begrenzt verfügbar. In diesem Beitrag stellen wir ein frei verfügbares multidialektales Korpus schweizerdeutscher Spontansprache vor. Es besteht aus Transkriptionen von Videointerviews mit Zeitzeugen des Zweiten Weltkriegs in der Schweiz, die im Rahmen des ArchiMob-Projekts (http://www.archimob.ch) vor ca. 15 Jahren aufgenommen wurden. Jedes Interview wird mit einer Gewährsperson geführt und dauert zwischen 1 und 2 Stunden. Die erste Version mit 34 transkribierten Interviews (durchschnittlich 15 500 Wörter pro Aufnahme, insgesamt 500 000 Wörter) ist 2016 veröffentlicht worden (Samardžić et al. 2016); eine zweite Version mit 9 zusätzlichen Interviews ist für Sommer 2017 geplant. Im ersten Teil des Beitrags beschreiben wir, wie die Dokumente transkribiert, segmentiert und mit den Tondaten aligniert wurden, wie wir versuchen, die massive (dialektale, sprecherspezifische und transkriptorenspezifische) Variation mittels einer zusätzlichen Normalisierungsebene zu vereinheitlichen, und wie wir die Daten mit spezifisch angepassten Korpuswerkzeugen zugänglich machen. Im zweiten Teil des Beitrags möchten wir anhand mehrerer Beispiele illustrieren, wie dieses Korpus computerlinguistische und dialektologische Fragestellungen neu beantworten kann.Peer reviewe

    Continuous variation in computational morphology - the example of Swiss German

    Get PDF
    International audienceMost work in natural language processing is geared towards written, standardized language varieties. This focus is generally justified on practical grounds of data availability and socio-economical relevance, but does not always reflect the linguistic reality of sub-standard varieties. In this paper, we aim at the computational description of the morphology of a language with continuous internal variation, as it is encountered in most dialect landscapes. The work presented here is applied to Swiss German dialects; these dialects are well documented through dialectological research and are among the most lively ones in Europe in terms of social acceptance and media exposure. Our work is inspired by previous research in generative dialectology and computational linguistics, which attempts to derive multiple dialect systems from a single reference system with the help of hand-written transformation rules. Such transformation rules may be called \textbf{georeferenced}, in the sense that they link to a set of geographic coordinates that can be grounded on a map. We improve on this work in several respects. First, our model associates all rules with probabilistic maps extracted from linguistic atlases. This allows us to handle transition zones in which several variants are accepted. Second, we provide a full implementation of this model on the basis of finite-state transducers. In addition to finite-state composition, which derives dialectal word forms by applying several rules in cascade, we propose a second type of composition, map composition, to compute the area of validity of the derived word forms on the basis of the probabilistic maps associated with the rules. In this paper, we will focus on two aspects of the proposed model: its theoretical value as a computationally effective description of continuous linguistic variation, and its practical value as a word-level machine translation system from Standard German into the various Swiss German dialects. We evaluate the model on the latter aspect
    corecore